Express Mail Label Number ER450357958US 



DATA RETRIEVAL METHOD, SYSTEM 
AND PROGRAM PRODUCT 

FIELD OF THE INVENTION 

5 

The present invention relates to retrieval and display of a document, and more particularly to a 
data retrieval system, a data retrieval method, a program for causing a computer to execute a 
data retrieval, a computer readable storage medium storing the program, a graphical user 
interface system for displaying a retrieved document, a program executed on the computer to 
10 implement a graphical user interface, and a storage medium storing the program, in which a 
relatively small number of documents can be efficiently retrieved from a large scale database 
comprising the documents, and displayed. 

BACKGROUND 

15 Recently, with the progress of computer hardware, the amount of information to be processed 
has more and more increased, and the database for storing the information has become larger. 
This trend is more remarkable in recent years when, in addition to the progress of the computer 
hardware, the network technology allows necessary information to be acquired using a browser 
software via the Internet. 

20 

Up to now, there have been proposed various methods for detecting document in a large scale 
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database that may be arranged to include the document data, image data and audio data. For 
example, in Published Unexamined Patent Application No. 2002-024268 by Kobayashi and 
others, there was disclosed a method and system for efficiently detecting a group of a relatively 
small number of documents having the same or similar keyword (hereinafter referred to as an 
5 outlier cluster) among the documents included in the database. Likewise, in "Latent semantic 
space: iterative scaling improves precision of inter-document similarity measurement" , SIGIR 
2000, pp.216-223 and SIGR 2001, pp.152-154 by Kubota and others, a method has been 
proposed for efficiently retrieving an outlier cluster by scaling a document vector in a potential 
semantic space. Though various methods and systems for retrieving a group of a small number 

10 of documents in the database as the outlier cluster have been proposed as above described, they 
may be applied to a relatively small database configured by sampling, but not fully applied to a 
larger database storing millions of documents in terms of the retrieval speed and the detectability 
of outlier cluster. Though the retrieval speed is possibly improved to some extent if the 
computer performance is enhanced, the retrieval for the outlier cluster must be separately 

1 5 improved by utilizing the characteristics of document keyword matrix in the linear algebra. 

Usually, the document data in the large scale database is digitized depending on whether or not a 
registered keyword is contained, configured as a document keyword vector, and stored in the 
database. The above method for retrieving the outlier cluster in the large scale database relies on 
20 calculating a residual matrix generated by successively deleting the document vector having the 
greatest norm. This successive calculation for the residual matrix is required to store in a main 
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memory of the computer the matrix successively generated using an engenvector or singular 
vector. For example, the document data having a size of the number of documents ' the number 
of attributes (keywords) is required to store in the main memory. Herein, in a case where the 
number of documents is 100,000, and the number of keywords is 1000, it is necessary to have a 
5 storage capacity of 100,000 ' 1000 ' 8 bytes = 800 MB to store the residual matrix in real 
number at double precision. If the number of documents and the number of keyword are 
increased, an amount of data that can not be stored by the ordinary computer must be stored by 
generating the residual matrix. In this invention, the document keyword vector digitized based 
on the keyword is simply referred to as the data. 

10 

On the other hand, various cluster retrieval techniques for application to the information 
retrieval or data mining have been so far offered. For example, Edie Rasmussen, "Clustering 
Algorisms", Chapter 16, Information Retrieval, W. B. Frankes and R. Baeza- Yates Eds, Prentice 
Hall (1992), L. Kaufman and P. J. Rousseuw, "Finding Groups in Data", John Wiley & Sons 
15 (1990) disclosed the techniques. Also, a method for automatically labeling the detected cluster 
was disclosed in Alexandrian Popescul and Lyle H. Unger, "Automatic Labeling of Document 
Clusters", (2000). The simplest method involves labeling the cluster of given document with a 
word having the greatest appearance frequency. 

20 Though the above method is simple, the cluster labeling is not sufficient, resulting in loss of the 
meaning of labeling, when meaningless words frequently appear in the document. In addition, 
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there are a method for labeling the cluster with a word mostly predicted in the cluster, instead of 
the frequency, and a method for labeling the cluster with a title of document nearest the center of 
multi-dimensional data that is a constituent of the cluster. However, labeling that reflects the 
characteristics of cluster is not always possible. Furthermore, there is a method for labeling the 
5 cluster with the frequency information and the most predicted word by introducing a tree 
structure in the documents, but it is troublesome to introduce the tree structure. The above 
methods have a drawback of being short of full identification when the keyword used in labeling 
is contained in the data constituting other cluster. 

10 In the above retrieval for outlier cluster, to enhance the availability of retrieved result, it is 
necessary that the outlier cluster is clearly distinguished from the major cluster, and the attribute 
(keyword) forming the outlier cluster is effectively presented to the user. 

As above described, there is a need for the data retrieval method and system to solve the 
15 problem of scalability associated with calculation of the residual matrix and improve the 
retrieval for the outlier cluster. Also, there is a need for the data retrieval method and system to 
label the major cluster and the outlier cluster in calculating each cluster and to improve the 
identification of each cluster. A still further need is for a graphical user interface system capable 
of making more effective use of the retrieved results by efficiently presenting the attribute 
20 (keyword) of the retrieved cluster to the user who has retrieved them. 
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SUMMARY 

In order to solve the above-mentioned problems, the present invention, a contribution vector is 
defined in connection with the inner product between calculated basic vector and document 
5 vector in calculating a residual matrix to enhance the retrieval for the outlier cluster. Using this 
contribution vector, the selective scaling is performed to calculate the residual vector, whereby 
the vector classified as so-called outlier cluster residing at relatively low percentage in the 
database is retrieved more efficiently. 

10 In this invention, a set of basic vectors are generated by making a set of calculated eigenvectors 
orthogonal, and cluster is generated depending on a similarity between the generated basic 
vector and the data, whereby clustered retrieval results are created. In this case, the similarity 
included in each cluster is computed, and the keywords having higher weight are selected and 
stored in descending order in computing the similarity, whereby a list containing the identifier 

1 5 (data ID) of the data is created. 

Also, in this invention, the number of data included in the generated cluster is specified, and the 
configuration of clusters in the database is graphically presented to the user using the area in 
accordance with the existence percentage of cluster in the database. 

20 

That is, the present invention provides a data retrieval system for causing a computer to retrieve 
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data being stored in a database, the retrieval system comprising a database storing data as a 
vector digitized based on a keyword, a means for generating a residual vector from the data to 
compute a covariance matrix and an eigenvector of the covariance matrix, and for generating 
and storing a set of basic vectors from a set of the computed eigenvectors, a means for reading 
5 out the data and at least one of the eigenvectors from a memory, and for computing a 
contribution of the eigenvector to the data, and for contracting or enlarging a residual vector to 
store, and a means for selecting a keyword to be used for labeling according to a similarity 
between the stored basic vector and the data, and a weight on the similarity so as to store the 
keyword in a memory. 

10 

The data retrieval system according to the invention may comprise a means for making the basic 
vectors orthogonal. In the data retrieval system according to the invention, the means for 
selecting the keyword to be used for labeling to store the keyword in the memory may further 
comprise a means for determining the weight on the similarity to the keyword and a means for 
15 storing a certain number of keywords in a descending order in the memory in connection with 
the weight. 

This invention also provides a data retrieval method for causing a computer to retrieve data 
stored in a database, the data retrieval method comprising the steps of reading out data from a 
20 database storing data as a vector digitized based on a keyword, computing and storing a 
covariance matrix and an eigenvector of the covariance matrix, using the data, generating and 
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storing a set of basic vectors from a set of the computed eigenvectors, reading out the data and at 
least one of the eigenvectors from a memory, and computing and storing a contribution of the 
eigenvector to the data, and computing a residual vector from the data and the eigenvector, and 
contracting or enlarging a residual vector by reading out the contribution to compute and store a 
5 new eigenvector. 

According to the invention, there is provided a computer executable program for implementing a 
data retrieval method for causing a computer to retrieve data stored in a database, the program 
comprising the steps of reading out data from a database storing data as a vector digitized based 

10 on a keyword, computing and storing a covariance matrix and an eigenvector of the covariance 
matrix, using the data, generating and storing a set of basic vectors from a set of the computed 
eigenvectors, reading out the data and at least one of the eigenvectors from a memory, and 
computing and storing a contribution of the eigenvector to the data, and computing a residual 
vector from the data and the eigenvector, and contracting or enlarging a residual vector by 

1 5 reading out the contribution to compute and store a new eigenvector. 

Also, the invention provides a computer readable storage medium storing a computer executable 
program for implementing a data retrieval method for causing a computer to retrieve data stored 
in a database, the program comprising the steps of reading out data from a database storing data 
20 as a vector digitized based on a keyword, computing and storing a covariance matrix and an 
eigenvector of the covariance matrix, using the data, generating and storing a set of basic vectors 
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from a set of the computed eigenvectors, reading out the data and at least one of the 
eigenvectors from a memory, and computing and storing a contribution of the eigenvector to the 
data, computing a residual vector from the data and the eigenvector, and contracting or enlarging 
a residual vector by reading out the contribution to compute and store a new eigenvector, and 
5 generating and storing a set of basic vectors from a set of computed eigenvectors. 

According to the invention, there is provided a graphical user interface system for displaying the 
computer retrieved data, the graphical user interface system comprising a database storing data 
as a vector digitized based on a keyword, a means for computing a basic vector from the data to 

10 store in a memory, a means for classifying data into clusters depending on a similarity between 
the stored basic vector and the data, for counting the number of data included in the cluster, and 
for selecting a keyword to be used for labeling according to a weight on the similarity so as to 
store in a memory at least the number of data and the keyword as a pair, and a means for 
displaying the cluster in spiral in order of the number of data of the cluster, and performing a 

15 different rendering processing for each adjacent cluster. 

Also, this invention provides a program for enabling a computer to implement a graphical user 
interface for displaying the computer retrieved data, the program comprising the steps of reading 
data from a database storing data as a vector digitized based on a keyword, computing a 
20 basic vector from the data to store in a memory, classifying data into clusters depending on a 
similarity between the stored basic vector and the data, for counting the number of data included 
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in the cluster, and for selecting a keyword to be used for labeling according to a weight on the 
similarity so as to store in a memory at least the number of data and the keyword as a pair, and 
displaying the cluster in spiral in order of the number of data of the cluster, and performing a 
different rendering processing for each adjacent cluster. 

5 

Also, the invention provides a computer readable storage medium storing a program for enabling 
a computer to implement a graphical user interface for displaying the computer retrieved data, 
the program comprising the steps of reading data from a database storing data as a vector 
digitized based on a keyword, computing a basic vector from the data to store in a memory, 

10 classifying data into clusters depending on a similarity between the stored basic vector and the 
data, for counting the number of data included in the cluster, and for selecting a keyword to be 
used for labeling according to a weight on the similarity so as to store in a memory at least the 
number of data and the keyword as a pair, and displaying the cluster in spiral in order of the 
number of data of the cluster, and performing a different rendering processing for each adjacent 

15 cluster. 

THE FIGURES 

Figure 1 is a schematic flowchart showing a data detecting method according to the present 
20 invention; 
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Figure 2 is a flowchart showing another data detecting method according to the invention; 

Figure 3 is a schematic flowchart of a selective scaling method according to the invention; 

5 Figure 4 is a schematic pseudo code for the selective scaling method according to the invention; 

Figure 5 is a schematic pseudo code for the selective scaling method according to the invention 
(continued from Figure 4); 

10 Figure 6 is a schematic view showing the relationship between residual vector and used basic 
vector according to the invention; 

Figure 7 is a schematic flowchart of a cluster classifying and labeling method of the invention; 

1 5 Figure 8 is a list of a keyword generated using the cluster classifying and labeling method of the 
invention; 

Figure 9 is a data structure created by the cluster classifying and labeling method of the 
invention; 

20 

Figure 10 is a schematic pseudo code for the cluster classifying and labeling method of the 
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invention; 

Figure 1 1 is a schematic pseudo code for the cluster classifying and labeling method of the 
invention (continued); 

5 

Figure 12 is a schematic pseudo code for the cluster classifying and labeling method of the 
invention (continued); 

Figure 13 is an output list of the labeled cluster data in a CSV format in this invention; 

10 

Figure 14 is a schematic flowchart showing a display method for displaying the cluster to the 
user in this invention; 

Figure 15 is a schematic functional block diagram for a data retrieval system according to the 
15 invention; 

Figure 16 is a schematic functional block diagram for the data retrieval system according to 
another embodiment of the invention; 

20 Figure 1 7 is a table showing the configuration of a test database for use in this invention; 
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Figure 18 is a table for plotting the computed number of data of cluster to the number of basic 
vectors in this invention; 

Figure 19 is a graph plotting the number of data of cluster computed by the conventional method 
5 to the number of basic vectors; 

Figure 20 is a graph showing a display form for a graphical user interface system of the 
invention; 

10 Figure 21 is a view showing a display form for the graphical user interface system of the 
invention; 

Figure 22 is a view showing a display form for the graphical user interface system of the 
invention; and 

15 

Figure 23 is a view showing a display form for the graphical user interface system of the 
invention. 

20 DETAILED DESCRIPTION 
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The preferred embodiments of the invention will be described below with reference to the 
accompanying drawings, but the invention is not limited to these embodiments. 

A. Schematic method for data retrieval 

5 

Figure 1 is a schematic flowchart showing a data retrieval method according to the present 
invention. In the data retrieval method of the invention as shown in Figure 1 , data is read from a 
database at step S10. In this invention, the data is converted into a document keyword vector, 
based on a keyword and using a binary model, and stored in the database. The keyword may be 

10 designated by the database administrator when generating the document keyword vector, or 
registered by selecting the keyword from accumulated documents. The data is a title or index 
for the document data, as well as audio data and image data. Any well-known method for 
digitizing the data based on the keyword may be used. For example, refer to Published 
Unexamined Patent Application No. 2002-924268, Published Unexamined Patent Application 

15 No. 2001-205183, Published Unexamined Patent Application No. 2001-324437, and Published 
Unexamined Patent Application No. 2001-329613. 

Then, at step SI 2, the selective scaling of the invention is applied to the data stored in an 
appropriate memory from the database to perform a process for calculating a covariance matrix. 
20 At step SI 4, the calculated engenvectors are made orthogonal to generate a set of estimated 
basic vectors (k dimensions). The set of estimated basic vectors enhances the sensibility for the 
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outlier cluster by contracting or enlarging a residual vector according to a contribution. At step 
SI 6, the cluster classification (labeling) is performed using the generated estimated basic vectors 
and the keyword for labeling registered in the memory. Then, the labeled cluster data is output 
at step SI 8. In a specific embodiment of the invention, the cluster data may be output in a CSV 
5 (Comma Separated Value) format, for example. Also, in this invention, the cluster data may be 
output in any format in which a character string is separated by a separator such as a tabbing. 
The output cluster data is output to a graphical user interface system for displaying the cluster 
data, whereby the cluster data is displayed or analyzed in this invention. 

10 Figure 2 is a flowchart showing another data retrieval method according to the invention. 
Another data retrieval method according to the invention as shown in Figure 2 is suitably applied 
when the capability of a platform to selectively scale is not sufficient to directly handle the 
documents held in the database. Another data retrieval method according to the invention 
accesses to database to read out the number of data registered in the database at step S20. At 

1 5 step S22, a determination is made whether or not the number of read data is greater than a preset 
threshold value, or appropriate for directly applying the selective scaling of the invention. If it is 
determined at step S22 that the database is beyond the threshold value (i.e., the database is 
comparatively large) (yes), a predetermined number of data is randomly sampled from the 
database to generate a sample database for the appropriate data and hold it in an appropriate 

20 storage area for an internal memory or hard disk at step S24. 
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The data of the generated sample database is read at step S26. The selective scaling method is 
applied to the read data at step S28. The basic vectors appropriate for retrieving the outlier 
cluster are generated at step S30. The cluster retrieval (labeling) is performed, using the 
generated basic vectors at step S32. And the cluster data is output at step S34. The output 
5 cluster data is sent on line or offline to the graphical user interface system of the invention to be 
used for analyzing the cluster within the database or the database. 

B. Generating the basic vectors for data retrieval: selective scaling method 

10 Figure 3 is a schematic flowchart showing a selective scaling process at step S28 of Figure 2 
according to the invention. The selective scaling process of Figure 3 is applied to the data of a 
sample database A' after being generated. In the selective scaling process of the invention as 
shown in Figure 3, at step S40, the number k of basic vectors to be generated, a threshold value 1 
for the magnitude of the inner product of basic vector and contribution vector, an offset value m 

15 set to increase or decrease the norm are registered in the memory for setting up of the 
expression. At step S42, a covariance matrix is calculated for the data in the sample database. 
At step S44, the obtained covariance matrix is decomposed into the eigenvalues to generate the 
maximum eigenvalue and an eigenvector corresponding to the maximum eigenvalue. 
Thereafter, the eigenvectors are made orthogonal by the Modified Gram Schmidt method to 

20 enhance the retrieval precision at step S46. 
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At step S50, a contribution vector R s [i] is calculated. The contribution vector is calculated as a 
ratio of the inner product of the j-th element of the basic vector being computed at that time and 
the j-th element of the data to the absolute value of the j-th element of document, as will be 
detailed later. Consequently, a residual vector is calculated at step S52. At step S54, a 
5 contraction factor w is calculated using the computed contribution vector. Using this contraction 
factor, the j-th element of i-th document of the residual vector is scaled. Then, the procedure 
returns to step S42 to calculate the covariance matrix of the residual matrix. When the 
designated k-th basic vector is calculated, the procedure is ended. 

10 Figures 4 and 5 are a pseudo code for performing the selective scaling method as shown in 
Figure 3. As shown in Figure 4, the pseudo code for performing the selective scaling method of 
the invention firstly makes declaration for each of the input values and variables to estimate the 
basic vector with high sensibility in the outlier cluster. Thereafter, because it is first of all 
necessary to estimate the basic vector at the first loop, a process of skipping calculation of the 

15 contraction/enlargement factor w, and calculating the covariance matrix and the contribution 
vector is performed as shown in Figure 5. In the process of the invention, first of all, the 
documents held in the database are extracted randomly at several percent to generate a sample 
database A' and a covariance matrix for the sample database A', and to obtain the maximum 
eigenvalue bp by decomposing into the eigenvalues, as shown in Figure 5. A method for use in 

20 this invention of generating the covariance matrix, decomposing into the eigenvalues and 
generating the basic vectors is referred to as a COV method. Then, the obtained eigenvalues are 
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made orthogonal by applying the Modified Gram Schmidt method to estimate the k basic 
vectors. 

In the subsequent loop, the vectors Rm, R s are defined to compute a contribution vector. The 
5 contribution vector is typically defined in the following expression, as listed in Figure 5. 

[Formula 1] 

R 5 = {R 5 [ u] : i = 1 , . . , M 1 } ,R b [i] = I R ± [ j]b p [ j) / J 2 Ri[j]Ri[j] 

10 In the above expression, r { = {RiOly = 1 >..»N} denotes the i-th document in the sample database, 
in which the element is divided by the square root of element ' element for the purpose of 
normalization. 

Then, a residual vector Ri is calculated in the next loop. This residual vector Rj is defined in the 
1 5 following expression, 

[Formula 2] 

r\^r ± -R m [i]xb p 

20 
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The residual vector R* corresponds to the vector in which an element corresponding to a 
contribution component in a direction of the basic vector calculated at that time is subtracted 
from the previously obtained residual vector. That is, the basic vector used at that time and the 
residual vector newly generated lie in the orthogonal relationship. This relationship is shown in 

5 Figure 6. As shown in Figure 6, the residual vector r { newly calculated or acquired in the 
previous loop is not usually orthogonal to the basic vector bp used for calculation at that time. 
On the other hand, because the basic vector bp is selected from the larger eigenvalue obtained in 
the directly previous calculation of eigenvalue, the basic vector bp is directed to the main cluster, 
and not suitable as the basic vector in retrieving the outlier cluster. In this invention, to improve 

10 the retrievability for outlier cluster, it is necessary to apply the COV method to the residual 
vector subtracted by the component of the eigenvector corresponding to the larger eigenvalue. 
The newly generated residual vector rC is made orthogonal to the basic vector bp used at that 
time, as shown in Figures 6A and 6B. Figures 6A and 6B show generating the residual vector 
for the data in which there is the possibility that it is generated in a positive or negative direction 

15 to the basic vector as a feature of the COV method for use in this invention. In either case, the 
newly generated residual vector r;' is orthogonal to the basic vector bp. 

Thereafter, the process gets back to the pseudo code of Figure 4, wherein the 
contraction/enlargement factor w is calculated, using a contribution vector R s generated by the 
20 pseudo code as shown in Figure 5, in this invention. The contribution vector R s is normalized, 
and has a greater contribution to the basic vector as the R s is closer to 1. Therefore, in 
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calculating the contraction/enlargement factor w, a step of contracting the residual vector having 
greater contribution and enlarging the residual vector having smaller contribution is performed 
at the next calculation for the residual vector. The residual vector with contracted or enlarged 
norm is used in the loop for generating the covariance matrix again, and the basic vector is 
5 calculated till i = k. 

C. Cluster classification/labeling 

C-l : Similarity calculation and keyword selection for labeling 

10 

In this invention, the clusters are automatically labeled, using the basic vector generated by the 
method as explained in section A, and the labeled data is classified into the clusters. Figure 7 is 
a schematic flowchart of a process for cluster classifying and labeling of the invention. The 
process of Figure 7 reads the basic vector data, the keyword data and the data held in the 
15 database or sample database generated in section A and stored in an appropriate memory at step 
S60. Then, at step S62, the basic vector is initialized such as I = Io (here, I 0 is an identifier of the 
basic vector to start). At step S64, the threshold value d input beforehand by the user and 
applied to the similarity held in the memory and the stop word are read from the memory. 

20 At step S66, the similarity between the first basic vector and the data read from the database is 
calculated and stored in the memory. This similarity is calculated as a simple inner product, or 



JP920020146US1 



-19- 



Express Mail Label Number ER450357958US 



used by calculating the normalized inner product. At step S68, among the elements contained in 
the document, p keywords having greater numerical value are selected in descending order from 
the keyword table, and a pair of data identifier and similarity are stored in the memory. The 
keyword having greater numerical value in this document is defined as a document intrinsic 
5 keyword. At step S68, in the calculation of similarity, the keywords having significant 
contribution to the similarity with the basic vector are read from the keyword list, and p 
keywords are selected in descending order, and stored in the memory. The keywords selected in 
this manner represent the similarity, and are defined as a labeling keyword (extrinsic keyword) 
in this invention. 

10 

At step S70, the similarities stored in the memory are sorted in descending order. Since the 
covariance matrix is used in this invention, positive and negative similarities are generated, 
whereby the documents having high similarity occur at both the positive and negative ends. At 
step S72, the absolute value of similarity in the positive and negative directions and the 
15 threshold value d are read and compared. The labeling keyword and the document intrinsic 
keyword are added, corresponding to the document identifier in which the absolute value of 
similarity is more than the threshold value, and registered as a pair in the memory. At step S72, 
corresponding keyword is not added, when the keyword is registered as the stop word. 

20 Figure 8 is a list of the similarity generated through the above process and the output of the data 
registered as the labeling keyword and the document intrinsic keyword. In the form of Figure 8, 



JP920020146US1 



-20- 



Express Mail Label Number ER450357958US 



the similarity is calculated from the inner product (iP) of the basic vector generated from the 
largest eigenvalue and the data, in which the largest value and the smallest value are indicated in 
the third and fourth lines. In the sixth to tenth lines, the document ID, the inner product of the 
document, the labeling keyword (extrinsic) and the document intrinsic keyword are listed. In the 
5 form as shown in Figure 8, p = 3 is assumed, in which the keyword having greatest contribution 
to the inner product is suzuki, then samurai, and the keyword having smallest contribution to the 
inner product is Japan. The keywords have a histogram as the contribution to the inner product, 
and are held in order of greater contribution and listed in the eighth to tenth lines. 

10 C-2: Preprocessing for cluster classification 

In a processing for performing the cluster classification, five data structures of doclist, 
keyDoclist, Key Table, result Id, and resultLabel are generated. A data structure generated to 
perform the cluster processing in this invention is shown in Figure 9. 

15 

The doclist, keyDoclist and keyTable as shown in Figure 9 are created by reading the data 
generated in section C-l and stored in appropriate storage means such as a hard disk. 
Specifically, doclist contains a keyword list for listing the labeling keyword and the document 
intrinsic keyword obtained for data ID in section C-l, in which the similarity with data is held. 
20 On the contrary, for each keyword, keyDoclist holds a list of data ID for the data containing the 
keyword as labeling keyword or document intrinsic keyword. Also, the sort index and count is 
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stored to be used in generating the labeling. The keyDoclist for the keyword is created, and 
registered in appropriate storage means. Thereafter, a hash table keyTable with the keyword as a 
hash key is created, in which keyDoclist is added as its element and registered in the storage 
means. 

5 

C-3: Labeling based on similarity 

The cluster detection and labeling according to the invention are performed by reading the above 
data and using the following process. The cluster detection and labeling are firstly made by 
10 scanning the data ID corresponding to the keyword of each keyDoclist in the hash table. When 
the data ID processed at that time is matched with the data ID contained in doclist, the count 
value is set such as 

KeyDoclist. count [keyword] '+ = aydoclist. similar ityy 

15 

In the above expression, a denotes a weight and y y denotes the absolute value of the similarity 
with the data listed as doclist. This absolute value takes a value ranging from 0.0 to 1.0, and the 
weight a has a default value of 10.0. Thereby, the similarity is extended to a range from 0.0 to 
10.0. Through this process, the value relying on the labeling keyword and the document 
20 intrinsic keyword contained in the document having greater similarity is set to the count value. 
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Then, the count value of keyDoclist is calculated for all the keywords with the basic vector of 
object at that time. Thereafter, the count values are sorted in descending order (with index). 
Herein, the index in sorting with index is used to hold the index of the keyword before sorting, 
and reversely refer to the original data ID data from the data ID in order of greater count values. 
5 Thereafter, the obtained index is registered as a sort index value of keyDoclist. 

Then, the variables resultld and resultLabel are initialized, the following procedure is repeated in 
the order in which the keywords after being sorted have greater count values, so that the cluster 
detection and labeling are performed simultaneously. Thus, the procedure is ended. 

10 

The labeling of cluster using the count value will be described below. First of all, the count 
value is monitored. If the count value is lower than or equal to a predetermined set value of e.g., 
2.0 (keyword not suitable as a cluster label), the labeling of cluster is ended. If the keyword of 
notice at that time is marked, the keyword is skipped because the keyword has been already dealt 

15 with, and a determination for the next keyword is made. Specifically, a data ID list is read and 
acquired from keyDoclist containing the keyword of notice at that time. The number of 
keywords contained in resultLabel is judged. If the number of keywords does not exceed a 
predetermined number p, that keyword is added to resultLabel, or if not, the mark of keyDoclist 
of the keyword of notice at that time is set to true. Thereafter, the following process is 

20 performed for each of the data IDs contained in the obtained data ID list. 
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It is checked whether or not the data ID of notice at that time is already contained in resultld. If 
the data ID of notice is contained, that data ID is skipped and the procedure goes to the next step. 
If not, that data ID is added to resultld. Then, doclist is entirely scanned. If the same data ID is 
contained, doclist is marked, and all the keyword list containing the doclist is gotten. If there is 
5 any keyword contained in this keyword list and not marked by keyDoclist, the keyword is 
selected as a label candidate, added to resultLabel by insertion sorting, and registered in the 
appropriate memory. Herein, the insertion sorting means sorting the keywords so as to leave 
those having smaller index (larger count value) in resultLabel by comparing the sort index value 
obtained in sorting with each of the sort index values of all the keywords contained in the 
10 present resultLabel. Thus, the merging process of keywords is ended. 

Thereafter, resultLabel obtained through this process is output as a label of the cluster, and 
resultld or its total number is output at the same time. At this time, the type of the cluster is 
decided, major, outlier or noise. It is checked whether or not any element remains in keyDoclist, 
15 and because there is the possibility that another cluster exists for the basic vector of notice at 
present, the above process is repeated till there are no elements unchecked in the keyDoclist. 

C-4: Cluster classification 

20 Figures 10 to 12 list a pseudo code for performing classification of the cluster. In Figures 10 to 
12, random sampling is made from the database to generate a sample database A'. The 
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definition of each variable appears in Figure 10. Since the cluster is labeled through the above 
process, the number of data (here number of pieces of data) contained in each cluster is easily 
counted. Thus, in a specific embodiment of the invention, classification of cluster is made in the 
following way. M denotes the number of data contained in the original database, when data is 
not randomly sampled from the database, or the number of data M' sampled in the sample 
database, when data is randomly sampled. 

In the procedure as shown in Figures 10 to 12, supposing that N is the number of data contained 
in each labeled cluster, the clusters satisfying the following expressions can be decided as the 
major cluster, the noise, and the outlier cluster in the specific embodiment of the invention. 



[Formula 3] 



if (N>Mx^) output ("Major cluster"); 
else if (N^Mx^q) output ("Noise"); 
else output ("Outlier (Minor) Cluster"); 



In the above expressions, b and c are appropriately set constants, wherein b<c is assumed to 
correspond to the kind of cluster. The values of b and c depend on the data to be dealt with. For 
example, b = 1.0 and c = 3.0 in the following example. That is, the data is noise if the number 
of data is 1% or less of the sample database, major cluster if it is 3% or more, or outlier cluster if 
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it is from 1% to 3%. In another example, b = 0.1 and c = 1.0 are supposed, because the news 
data covers a wide variety of areas including politics, economy, culture, international situation, 
entertainment, health, sports and affairs. 

D. Graphical user interface for visualizing the cluster 

The cluster, classification and labeling generated through the process as described in sections B 
and C are output in a CSV format for the basic vector obtained in section A or the basic vector 
obtained by the COV method in the specific embodiment of the invention. Figure 13 shows an 
output form of the labeled cluster data generated through the process as described in sections B 
and C, which is configured in the CSV format. In Figure 13, a list of data is given as basic 
vector ID, labeling of cluster, number of data, percentage of entire database to the document, and 
cluster class (type) in the first line. In this section D of the invention, a graphical user interface 
for visualizing the cluster data by inputting the data in the above format will be described below. 

Figure 14 is a flowchart showing a visualizing method for use with the graphical user interface 
in this invention. The visualizing method of the invention reading the cluster data stored in the 
CSV format as shown in Figure 13 from the storage device such as memory, hard disk, flexible 
disk, optical magnetic disk, magnetic tape, compact disk (CD), or digital versatile disk (DVD) at 
step S80. At step S82, the occupancy area and angle of a region representing the predetermined 
cluster in spiral is decided in accordance with the number of documents contained in the cluster 
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in spiral rendering. At step S84, the (multi-dimensional) cluster data obtained from a 
multi-dimensional basic vector is mapped in two-dimensional spiral and rendered. 

At step S86, a display mode intrinsic to the cluster is automatically assigned to the segment 
5 constituting the spiral rendering. At step S88, the display data for displaying the cluster is 
generated and registered. At step S90, the registered display data is read, and displayed. The 
display mode involves the color, shading, density and pattern in the specific embodiment of the 
invention. The rendering for use in this invention uses the shading and the color tone, but 
various other rendering methods may be used in this invention. 

10 

A reason for displaying the cluster in spiral is that even if the number of clusters is increased, the 
display width is not varied by the display space to degrade the recognition such as a histogram. 
In consideration of the number of documents contained from the major cluster to the outlier 
cluster, the major cluster is arranged outside the spiral, and the outlier cluster is arranged inside 
15 the spiral, whereby data is naturally displayed corresponding to the number of documents 
contained. Thus, the configuration of cluster within the database can be intuitively presented to 
the user. 

The spiral rendering at step S84 in the visualizing method of the invention will be described 
20 below. In the rendering at step S84, the spiral is defined in the following parametric expression. 
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[Formula 4] 

f x(t)=(t / 2 rm) k cos t 
X y(t)=(t/2im) k 3int 

5 In the above expression, k is a positive real number value, and t is an angle assigned to the 
cluster area constituting the spiral. For k = 1, "Archimedes' spiral" (Clifford A. Pickover, 
"Computers, Pattern, Chaos, and Beauty", St. Martin's Press, New York, 1990). Herein, if k is 
larger, the spiral becomes closer to "Sea shell" (logarithmic spiral), but the above expression is a 
locus of spiral and contains no concept of area. 

10 

In this invention, at step S86, to enable display using the area, the present inventors have defined 
a "layer" every time the spiral goes around or the angle t becomes an multiple of 2p, and 
introduced the concept of area by connecting the correspondence points (with the same angular 
period) of an outer layer and an immediately inner layer and defining a region surrounded by the 
15 outer layer, the inner layer and a line connecting the correspondence points. Next, one 
circumference is divided into m, and an angle q(c) assigned to the cluster (c) in accordance with 
the number of documents containing cluster (c) is given in the following expression. 

[Formula 5] 

20 

6(c) = 2rm^f 
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Where d(c) is the number of documents containing the cluster (c) of notice at that time, and D is 
the total number of documents. In the specific embodiment of the invention, one circumference 
is divided into m, and one piece is defined as a "segment", whereby the number of segments for 
5 the cluster of notice at that time is decided in the following expression. 

[Formula 6] 

segmen t = cei Ife) 

10 

In the above expression, ceil() denotes an integer value of the numerical value in parentheses 
rounded up. 

Moreover, the rendering is performed in accordance with the display mode having the obtained 
15 segment designated. In this specific embodiment of the invention, the color is used as the 
display mode, and to enhance the visibility, the rendering is performed to have lower density in 
the central part and higher density at layer boundary so that the spiral is recognized like a 
"shell", with the simple shading made to swell up. 

20 In another embodiment of the visualizing method according to the invention, the spiral is 
subjected to contraction or enlargement, parallel translation and rotation in a spirally rendered 
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state, or the label attached to the cluster is pop-up displayed interactively by pointing to and 
selecting the cluster. In another embodiment of the invention, the user selects a plurality of (e.g., 
three) clusters in the spirally rendered state, and the document contained in each cluster is 
projected three dimensionally, using the basic vector corresponding to the selected cluster as the 
5 Cartesian coordinates for display. In a three dimensional display mode of the invention, the 
cluster is subjected to contraction and enlargement, parallel translation and rotation in three 
dimensions, and the document is selected. 

The pseudo code for executing the GUI for spiral rendering according to the invention is as 
10 follows. This pseudo code inputs a file name in CSV format, a spiral pitch number, the total 
segment number in spiral, and a scaling factor for the number of divisions for one circumference 
concerning how much angle is given according to the percentage of cluster. In this invention, 
these input values are used to perform the spiral rendering, in which the area of each cluster is 
portioned at a circumferential angle of spiral corresponding to the size of cluster on a display 
15 window so as to excellently grasp a distribution from the major cluster to the outlier cluster. 
After the end of the spiral rendering, the GUI is ended by clicking a close button on the display 
window with pointer means such as a mouse. 

GUI (filename, scale, pitch, segments) { 
20 FileName: String;//CSV filename 

Scale: Integer;//scale factor for spiral angle 
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Pitch: Integer;//pitch number of spiral 

Segments; Integer;//total number of segments 

TotalDocs: Integer;//total number of documents 

NumClusters: Integer;//number of clusters 

sign: Integer;//1(+1), -l(-l) sign 

dimension: Integer;//dimension of basic vector 

numDocs: Integer;//number of documents contained in cluster 

label: String;//label of cluster 

percent: Float;//percentage of the cluster in total document 
type: Integer;//type of cluster (0:Major, l:Outlier, 2:Noise) 

at: AffineTransform;//Affine transformation (rotation, movement, enlargement/contraction) 
holding 

spr: array of Spiral;//for holding spiral 
file: FILE;//file handle 

//step 1 : reading a CSV file 
file=open(filename); 
while (!file.EOF){ 

file.read(sign. dimension, numDocs, label, percent, type); 

totalDocs=totalDocs + numDocs; 

registerCluster(sign, dimension, numDocs, label, 
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percent, type); 
numClusters=numClusters + 1; 

} 

//step 2: deciding the size and color of cluster 
5 for (inti=l ; ifnumClusters; i++){ 

Cluster c=(Cluster)getCluster(i); 

Size = (scale*C.getNumDocs()/totalDocs)+l; 

Color color = randomColor();// different colors for adjacent clusters 

c.setAttribute(size, color); 

10 } 

//step 3: drawing the cluster and GUI process 
spr = new Spiral[segments]; 
while (true){ 

at = getAffineTransformO;// acquiring parameters for affine transformation 
15 drawSpiral(spr, at, pitch, segments); 

drawSelected(at);// pop-up the label of selected cluster 
if (mouse.windowClosed)exit();// exit is selected by mouse 

} 

} 

20 

drawSpiral(spr, at, pitch, segments) { 
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setTransform(at);// for affine transformation 

Float delta = 2*pitch*p/segments; 

Integer iO = segments/4;// skip drawing for first 1/4 spiral 

Float t = iO* delta; 

for (int i=0; I < segments; i++, t+=delta) { 
intj = segments - i - 1; 
int layer = pitch * j / segments; 
Float x = (t/(2*pitch*p)) 3 cos t; 
Float y = (t/(2*pitch*p)) 3 sin t; 
Spr[j] = new Spiral(x,y, layer); 

} 

Integer clusterCnt = 0; 
Cluster c = (Cluster)getCluster(clusterCnt); 
Integer numPolygons = c.getSize(); 
Integer oneLayer = segments/pitch; 
for (int j=l, k=0; j < segments; j++){ 
While (numPolygons > 0){ 

Vector2D xO = map2D(spr[k]); 

Vector2D xl = map2D(spr[k+l]); 

Vector2D x2 = map2D(spr[k+oneLayerf 1]); 

Vector2D x3 = map2D(spr[k+oneLayer]); 
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Color = c.getColor(); 

fil 1 Polygon WithGradientPaint(xO,xl,x2,x3,color); 

numPolygons--; 

k = k+ 1; 

5 j-j + 1; 

} 

clusterCnt++; 

Cluster c = (Cluster)getCluster(clusterCnt); 
numPolygons = c.getSizeQ; 

10 } 

} 

E. Data retrieval system 

15 Figure 15 is a schematic functional block diagram for a data retrieval system according to the 
invention. The data retrieval system 10 of the invention as shown in Figure 15, comprises a 
basic vector generating part 14 for reading data stored in a database 12, and generating the basic 
vector that can be appropriately used to retrieve the outlier cluster by applying the selective 
scaling method, a classifying/labeling part 16 for performing the cluster classification and 

20 labeling using the generated basic vector, and an output part 18 for outputting the generated 
cluster classification and labeling in a predetermined data format. The basic vector generating 
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part 14, the classifying/labeling part 16 and the output part 18 are configured by a single 
computer apparatus 20 placed at a single site in a preferred embodiment of the invention as 
shown in Figure 15. 

5 The computer apparatus 20 may be a personal computer, a workstation, and a general-purpose 
large computer. In another embodiment of the invention, the basic vector generating part 14 for 
performing the selective scaling method which imposes the greatest load on the central 
processing unit (CPU) is configured on the general-purpose large computer or dedicated 
workstation having a lower processing speed, whereby the basic vector is calculated in night 
1 0 time when the user does not frequently make access. 

The database 12 is directly connected to the computer apparatus 20 in the embodiment as shown 
in Figure 15. The computer apparatus 20 as shown in Figure 15 may handle temporary data 
which is added or deleted to or from the database 12, or the computer apparatus 20 may 
15 comprises a document vector generating part for digitizing the data, registering and generating 
the keyword, and generating the document keyword vector. 

The basic vector generating part 14 fiirther comprises a selective scaling part 22 for performing 
the selective scaling and a COV processing part 24 for generating the basic vector by applying 
20 the COV method. The selective scaling part 22 performs the selective scaling such as 
calculating the residual vector and calculating the contribution vector, and the COV processing 



JP920020146US1 



-35- 



Express Mail Label Number ER450357958US 

part 24 estimates the basic vector by applying the COV method for a certain matrix. Also, the 
basic vector generating part 14 further comprises a basic vector holding part 26 for holding the 
basic vector generated by the selective scaling part 22, whereby the data of basic vector is passed 
to the classifying/labeling part 16. 

5 

The classifying/labeling part 16 reads the information of the basic vector generated by the basic 
vector generating part 14 from the basic vector holding part 26, and reads the data from the 
database 12 to classify and label the cluster, whereby the table of Figure 9 and data such as 
resultLabel and resultID are stored in the storage means 28 such as memory or hard disk. 

10 

The data retrieval system of this invention receives a retrieval instruction via a network, not 
shown, retrieves the cluster in response to the instruction, and sends its result to the network in 
data format. Moreover, in this invention, the cluster is displayed, using the spiral rendering, and 
its result as a Web page is displayed on the computer apparatus installed at a site issuing the 
1 5 retrieval instruction. 

The output part 18 comprises an output data generating part 30 and a display device 32 for 
displaying the data generated by the output data generating part 30. The output data generating 
part 30 reads the data stored in the memory 28, processing the data and outputs the data in CSV 
20 format as shown in Figure 13. Also, the display device 32 reads the data in CSV format, using 
the graphical user interface system of the invention, executes the process as described in section 
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C, and displays the cluster on the display screen, using the spiral rendering. 

In another embodiment of the invention, the output part 18 comprises the output data generating 
part 30 and an external storage medium control part, not shown. For example, the output part 18 
outputs data in predetermined format to the storage medium such as a flexible disk, a hard disk, 
an optical magnetic disk, a magnetic tape, a compact disk (CD), or a digital versatile disk (DVD) 
under the control of the external storage medium control part. In this case, the data stored in the 
storage medium is read into the computer apparatus having the graphical user interface of this 
invention, and displayed using the spiral rendering. 

Figure 16 is a block diagram showing the data retrieval system according to another embodiment 
of the invention. In the data retrieval system as shown in Figure 16, the basic vector generating 
part 14 comprises a random sampling part 34 for causing the random sampling by judging the 
amount of data in accordance with a processing capability of the computer apparatus 20, and a 
sample database 36 for storing the sample data generated by random sampling. The random 
sampling part 34 reads out the number of data stored in the database 12, and compares it with a 
preset threshold value, in which if the data beyond the threshold value is contained in the 
database, the random sampling is performed. The random sampling part 34 accepting the 
instruction extracts the data stored in the database 12 randomly and registered it in the sample 
database 36. 
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Also, the random sampling part 34 instructs the selective scaling part 22 to change the data 
reading address from the database 12 to the sample database 36. The selective scaling part 22 
receiving this instruction reads the data from the sample database 36 and generates the basic 
vector by applying the selective scaling method and the COV method when the configuration of 
5 the sample database 36 is ended. The data retrieval system as shown in Figure 16 allows for 
selecting the processes depending on the processing ability of the computer apparatus used. 

This invention will be described below by way of example, but the invention may not be limited 
to the examples. 

10 

EXAMPLES 

Configuration of test database 



15 A test database having 2000 keywords and 100,000 data was created. The reason why the test 
database is artificially configured is that when an already existing news database is used, the 
existence of an outlier cluster is unknown, and it is unsuitable for judging the data retrieval of 
the invention. Figure 17 is a table showing the configuration of the created test database. As 
shown in Figure 17, the data consists of a major cluster, an outlier cluster and a noise, in which 

20 the major cluster is 5 clusters with an existence ratio of 4%, the outlier cluster is 20 clusters with 
an existence ratio of 2% and the noise is other clusters with an existence ratio of 1% or less. 
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Example 1 

A sample database was created by randomly sampling 10,000 documents from the above 
database while preserving each cluster at an existence ratio of 10%. For the data contained in 
5 the sample database created in this manner, the number of keywords was reduced to 2000 
dimensions, and the basic vector was reduced to 20 dimensions, whereby the outlier cluster was 
classified and labeled. A threshold value d for cluster detection was 0.4, and the classification of 
cluster was based on the following expression. It was supposed that b was equal to 1.0 and c 
was equal to 3.0. 

10 

[Formula 7] 

if (N>Mxj=^) output ("Major cluster"); 

else if (N<MXt^) output ("Noise"); 
15 else output (" Outlier(Minor) Cluster"); 

In the calculation, a personal computer mounting Pentinum 6 (registered trademark) 4 processor 
(1.7 GHz) made by Intel was used. Under the above conditions, the retrieval for cluster by the 
selective scaling method of the invention was made, so that the major cluster and the outlier 
20 cluster were detected at a precision of 100%. In the selective scaling method of the invention, 
the duplication ratio (percentage that the same data is contained in a plurality of clusters) was as 
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small as that of the LSI method having the lowest ratio in the conventional methods. From the 
view point of avoiding the repetitive calculation of data for the basic vector, the selective scaling 
method was efficient. 

Comparative example 

As a comparative example, the computation time, the detection ratio of major cluster, the 
detection ratio of outlier cluster, and the duplication ratio at which the same cluster is detected in 
a plurality of basic vectors are compared, using a latent semantic interpretation (LSI, 
comparative example 1) and an ordinary COV method (comparative example 2) for the same 
sample database. When the duplication ratio is smaller, it is possible to prevent duplicated 
computation, and more efficient computation. Its result is listed in table 1, with the result of 
example 1. 



Table 1 



Example 


Detection ratio of major 
cluster (%) 


Detection ratio of outlier 
cluster (%) 


Duplication ratio (%) 


Selective Scaling 


100 


100 


36 


LSI (comparative example 
1) 


100 


50 


33 


COV (comparative 
example 2) 


100 


60 


76 



Example 2 

In this example 2, the stability and precision of data retrieval by random sampling was examined 
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by changing the stop conditions for classifying and labeling the cluster. The random sampling 
was 1% from the test database, the number of basic vectors was 20, and a stop condition r = 3 or 
for stopping the cluster detection and labeling when no new cluster was found three times 
consecutively was included. The threshold value d was 0.5, and the labeling was p = 1. Figure 
5 18 shows the results obtained for r = 1. In Figure 18, the clusters detected in the positive and 
negative directions for a basic vector bi are shown with ID of the basic vector. Also, the 
contribution in Figure 18 indicates an average value of inner product iP before normalization. 
Among the results of Figure 18, a blank in the line corresponding to the basic vector indicates 
that no cluster is detected in the direction of the basic vector in this invention. From Figure 18, 
10 it is found that the major cluster and the outlier cluster are excellently retrieved in the positive 
and negative directions of the basic vector. Moreover, the same examination was made for the 
instances of r = 2 and r = 3. As a result, it was revealed that there was less significant effect on 
the results, and the cluster retrieval at sufficiently high precision was allowed under the stop 
condition of r = 1. 

15 

Examples 3, 4 and 5 

As a test database, LA Times news data containing 127,742 documents was used. In the 
example 3, whether or not the outlier cluster to be detected, and the validity of labeling were 
examined. For the document data contained in LA Times, the cluster detection and labeling was 
20 performed in such a manner that the random sampling (1 .35% (2000 data): examples 3, 4 and 5) 
was performed three times, whereby 64 basic vectors were generated to perform the cluster 
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retrieval and labeling. The cluster retrieval was made by calculating the inner product of each 
basic vector and a document vector generated from the documents of the original LA Times 
database, and sorting the calculated inner products in descending order. And p labeling 
keywords and p document intrinsic keywords were listed till the absolute value of inner product 
is greater than the threshold value in both the positive and negative directions. In this example, 
when the labeling keyword and the document intrinsic keyword are selected for the same 
keyword, the labeling keyword is preferentially listed. 

For the obtained cluster classification and labeling, the basic vector is plotted along the 
transverse axis and the number of documents (size of cluster) is plotted along the longitudinal 
axis, whereby the occupation percentage of outlier cluster was examined. Figure 19 shows those 
results. In Figure 19, the position at which the basic vector corresponds to the cluster of about 
3000 data is indicated by the arrow. 

Comparative example 3 

Using the basic vector generated by the conventional COV method, the cluster classification and 
labeling were performed, and data was plotted in the same manner as in the examples 3 to 5. 
Figure 20 shows the result. 

From a comparison between Figures 19 and 20, it will be found that even though the same 
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cluster classification and labeling are applied using the equal number 64 basic vectors, the 
outlier cluster is more distributed than the major cluster by the selective scaling method of the 
invention as shown in Figure 19, as compared with the results (comparative example 3) of 
Figure 20 by the conventional COV method. Though the graph obtained by the ordinary COV 
5 method is a smooth hyperbola, when the selective scaling method of the invention is applied, the 
number of major clusters is sharply decreased to produce a sharply descending portion, with a 
relatively high detection percentage of outlier cluster, in any instance of random sampling of the 
examples 3 to 5. That is, the outlier cluster is more efficiently detected than the major cluster by 
applying the selective scaling method. 

10 

In other words, with the selective scaling method of the invention, though the outlier cluster can 
not be usually found if the number of basic vectors is increased, the outlier cluster (that may not 
be found by the COV method) is detected very excellently, while the number of basic vectors, 
namely, the dimension of a document keyword matrix is kept smaller. 

15 

Example 6 

In this example 6, the display of data using the graphical user interface of the invention will be 
specifically described below. The cluster detection, classification and labeling were performed 
20 using the basic vector obtained in the example 1, to create a data file of CSV format. The CSV 
data file stored in appropriate storage means was read into the graphical user interface system of 
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this invention. Thereby, the clusters were displayed in spiral in which the major clusters having 
a greater number of documents were arranged in the outer layer, and the clusters having a 
smaller number of documents were arranged in the inner layer. Figure 21 is a view showing a 
display form for the graphical user interface system by applying the spiral rendering, based on 
5 the CSV data file obtained in the example 1, and selecting the maximum major cluster with a 
pointer icon to pop-up display the label. As shown in Figure 21, with the graphical user 
interface of the invention, the clusters can be fully displayed for the user, even if the number of 
clusters is increased. 

10 Example 7 

Figure 22 is a view showing an enlarging process in the graphical user interface system of the 
invention. In Figure 22, the label is displayed by enlarging the result of Figure 21, and selecting 
another major cluster with a pointer icon. As shown in the examples 6 and 7, the graphical user 
interface of the invention permits the user to intuitively judge the existence ratio in the database 
15 from the major cluster to the outlier cluster, and easily analyze the data contained in the 
database. 

Example 8 

20 Figure 23 is a view for cluster display by applying a method as described in Published 
Unexamined Patent Application No. 2001-329613. This is utilized along with the graphical user 
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interface of spiral shape according to the invention. Though in Published Unexamined Patent 
Application No. 2001-329613, the document title is pop up, the cluster label is pop up when the 
cluster is labeled in the invention. In this example 8 of Figure 23, the user confirms the above 
label and designates, whereby the corresponding outlier cluster is displayed in the Cartesian 
5 coordinate system, using the basic vector corresponding to the label. As shown in Figure 23, the 
outlier cluster in the large data can be clearly presented to the user, using the cluster detection, 
classification and labeling of the invention. 

Means or portion for implementing each of the functions of the invention is configured as a 
10 software or a software module group described in a programming language for computer, but 
may not necessarily be configured as a functional block as shown in the drawings. 

The above program for performing the information retrieval method of the invention is 
described in any of various programming languages, such as C language, C++ language, Java 0 
15 (registered trademark), and the code describing the program of the invention is stored in a 
computer readable storage medium such as a magnetic tape, a flexible disk, a hard disk, a 
compact disk (CD), an optical magnetic disk, or a digital versatile disk (DVD). 

As described above, with this invention, the basic vectors are generated efficiently with an 
20 enhanced detection ability of the outlier cluster, and the labeling is made in consideration of the 
similarity and the keyword held by the data. As a result, the labeling is meaningfully enabled for 
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both the similarity and the characteristic of data, whereby the data analysis or retrieval for the 
documents in the database is easily made. That is, with this invention, it is possible to provide a 
data retrieval system, a data retrieval method, a program for causing a computer to execute a 
data retrieval, a computer readable storage medium storing the program, a graphical user 
5 interface system for displaying a retrieved document, a program executed on the computer to 
implement a graphical user interface, and a storage medium storing the program, in which a 
relatively small number of documents are efficiently retrieved from a large scale database 
comprising the documents, and displayed. 

10 While the invention has been described with respect to certain preferred embodiments and 
exemplification's, it is not intended to limit the scope of protection thereby, but solely by the 
claims appended hereto. 

15 
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